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Abstract 

Canonical correlation analysis is a technique to ex- 
tract common features from a pair of multivariate 
data. In complex situations, however, it does not 
extract useful features because of its linearity. On 
the other hand, kernel method used in support vec- 
tor machine is an efficient approach to improve such 
a linear method. In this paper, we investigate the 
effectiveness of applying kernel method to canonical 
correlation analysis. 
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1 Introduction 

This paper deals with the method to extract com- 
mon features from multiple information sources. For 
instance, let us consider a task of learning in pat- 
tern recognition, in which an object is given by using 
an image and its name is given by a speech. For a 
newly given image, the system is required to answer 
its name by a speech, and for a newly given speech, 
the system is to answer the corresponding image. The 
task can be considered to be a regression problem 
from image to speech and vice versa. However, since 
the dimensionalities of images and speeches are gen- 
erally very large, a regression analysis many not work 
effectively. In order to solve the problem, it is useful 
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to map the inputs into low dimensional feature space 
and then to solve the regression problem. 

The canonical correlation analysis (CCA) has been 
used for such a purpose. CCA finds a linear trans- 
formation of a pair of multi-variates such that the 
correlation coefficient is maximized. From an infor- 
mation theoretical point of view, the transformation 
maximizes the mutual information between extracted 
features. However, if there is nonlinear relation be- 
tween the variates, CCA does not always extract use- 
ful features. 

On the other hand, the support vector machines 
(SVM) are attracted a lot of attention by its state-of- 
art performance in pattern recognition [8J The kernel 
trick used in SVM is applicable not only for classifica- 
tion but also for other linear techniques, for example, 
kernel regression and kernel PC A 6 

In this paper, we apply the kernel method to 
CCA. Since the kernel method is likely to overfit the 
data, we incorporate some regularization technique 
to avoid the overfitting. 



2 Canonical correlation analy- 
sis 

CCA has been proposed by Hotelling in 1935 [3]. 
Suppose there is a pair of multi-variates x e 7^"^ 
y G T?."", CCA finds a pair of linear transforma- 
tions such that the correlation coefficient between ex- 
tracted features is maximized (Fig[T]) For the sake of 
simplicity, we assume that the averages of x and y 
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Figure 1: CCA 



2. There is strong nonlinear relation between x and 

y, 

It is impossible of improvement in the first case. How- 
ever, in the second case, we can obtain the relation 
by some methods. One of those methods is to al- 
low the nonlinear transformation and Asoh et al[4] 
has proposed a neural network model that approx- 
imates the optimal nonlinear canonical correlation 
analysis. However, this model requires a lot of com- 
putation time and it also has a lot of local optima. In 
this paper, we incorporate the kernel method, which 
enables the nonlinear transformation as well as the 
small computation and no undesired local optima. 



arc 0, and the dimensionality of feature is 1, then by 
the transformations 



3 Kernel CCA 



(1) 

v^[D,y), (2) 

we would like to find the transformation a, b that 
maximizes 

(3) 



(a, x), 

(b,y), 



■\/Var[u]Var[w] 

where (a, x) represents the inner product. We have 
to further assume 



Var[M] = Var[w] = 1, 



(4) 



to reduce the freedom of scaling of u and v. a and 
b can be found by an eigen vector corresponding to 
the maximal eigen values of a generalized eigen value 
problem. If we need more than one dimension, we 
can take eigen vectors corresponding other maximal 
eigen values. 

CCA is important in an information theoretical 
viewpoint, since it finds a transformation that maxi- 
mizes the mutual information between features, when 
X and y are jointly Gaussian. Even if the assumption 
is not fuUfilled, CCA can be still used in some cases. 
However, if the purpose is regression, the large values 
of correlation coefhcients are crucially necessary. The 
reasons that correlation coefficients are small can be 
considered in the following cases: 




Figure 2: Kernel CCA 



First, X and y are transformed into the Hilbert 
space, (j)x{'^) G Hx and 4>y{y) G Hy. By taking inner 
products with a parameter in the Hilbert spaces, a S 
Hx, and b € Hy, we find a feature 



(5) 
(6) 



V = (&,0y(y)), 

which maximizes the correlation coefficients. 

Now, suppose we have pairs of training samples 
{{^i,yi)}iLi- o, and b can be found by solving the 
Lagrangean 



1. X and y does not have almost any relation. 



£o = E[{u-E[u]){v-E[v])] 
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-^E[iv-E[v]n (7) 

However, the Lagrangcan is ill-posed as it is when 
the dimensionalities of the Hilbert spaces are large. 
Therefore, we introduce a quadratic regularization 
term and we get well-posed Lagrangean, 

C = jCo + '^{\\af+\\bf), (8) 

where 77 is a regularization constant. Note that the 
average of u is given by 

i 

and the average of uv is given by 

E[H = ^ ^{a,M^i)){bAy{yi))- (10) 

Now, from the condition that the derivative of £ by 
a is equal to 0, we get 

a = ^cxi<px{^i), (11) 

i 

where aj is a schalar, then as a result, we have 

W = ^ai(0x(Xi),0a;(x)). (12) 
i 

Therefore, u can be calculated by only inner prod- 
ucts in Hx- Kernel trick used in SVM uses a kernel 
function kx{xi,X2) instead of the inner product be- 
tween (pxi^i) and (px{^2)- In practice, since we don't 
need an explicit form of (j)x, wc first determine kx 
that can be decomposed in the form of inner prod- 
uct. From Mercer theorem, the symmetric positive 
definite kernel kx can be decomposed into the inner 
product form. 

Let us rewrite jC by the kernel. First, let a = 
{ai, . . . , ajv)"^, /? = (/?!, . . . ,Pn)'^, and we define the 
matrices 

{Kx)ij = kxip^ii'X.j), (1<^) 



{Ky)ij=ky{yi,yj). (14) 
Then, we obtain £ by 

C = a^MjS 

-^a^La-^a^Np (15) 

where 



M = 




(16) 


L = 


IkJJKx+ViKx, 


(17) 


N = 




(18) 


J = 


. 1 T 


(19) 


1 = 


(1,...,1)^, 


(20) 



and ?7i = yy/Ai, 772 = ■q/\2- 

If 77 > is satisfied, L and A'" are positive definite 
almost surely, and we can show Ai = A2 = A from 
the constraint, then as a result we have a generalized 
eigenvalue problem for a, /? 

M/3 = \La, (21) 
M^a = \N(3, (22) 

It can be solved by generalized eigenvalue problem 
package or Cholesky decomposition of L and M. 

4 Computer simulation 

4.1 Simulation 1 

We generate training samples and test samples inde- 
pendently as follows. First 6 is generated from the 

uniform distribution on [— 7r,7r], and then a pair of 
two dimensional variables x and y are generated by 

where ei, 62 are independent two dimensional Gaus- 
sian noise with a standard deviation 0.05. 
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We test for 40 training samples and 100 test sam- 
ples. The x-y scatter plot of (linear) CCA is shown 
figO The correlation coefficients are as follows, 
where the values for test samples are in the braces. 
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0.71 (0.40) 


0.00 (0.09) 


U2 


0.00 (0.00) 


0.27 (0.19) 



The x-y plot of kernel CCA is shown in figlH We 
used Gaussian kernel 



^(xi,X2) = exp( 



|xi - X2I 
2a2 



(25) 



both for X and y, where parameters are take by 
rj = 1.0, cr = 1.0. The correlation coefficients are 
as follows, where the values for test samples are in 
the braces. 
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V2 




0.98 (0.95) 


0.00 (0.02) 


U2 


0.00 (0.02) 


0.97 (0.93) 



We only show upto the second components, though 
we have higher components in the kernel CCA. 

4.2 Simulation 2 

This section examines an artificial pattern recogni- 
tion tasks in multimodal setting described in the be- 
ginning of the paper. 

Training samples x and y are generated randomly 
from the uniform distribution on [0, 1]^ and make ran- 
dom pairs of training samples. Each pair of training 
samples represent a class center. Test samples are 
generated by adding an independent Gaussian noise 
with standard deviation 0.05 to training samples ran- 
domly chosen. 

We test 10 training samples (classes) and 100 test 
samples. 

x-y plot of CCA result is shown in figE) The cor- 
relation coefficient between features are as follows, 
where the values for test samples are in the braces. 
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Figure 3: Simulation 1. x-y plot for CCA. The num- 
bers represent the increasing order of training sam- 
ples for 6 



x-y plot of kernel CCA result for the same dataset 
is shown in figEl 

We use Gaussian kernel in which parameters are 
taken 77 — 0.1, cr = 0.1. The correlation coefficients 
between features are as follows, where the values of 
test samples are in braces. 
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5 Concluding remarks 

5.1 Kernel method and regularization 

We have proposed kernel canonial correlation analy- 
sis in which the kernel method is incorporated in the 
kernel method. It is similar to SVM that the point is 
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Figure 4: Simulation 1. x~y plot for kernel CCA Figure 5: Simulation 2. a;-y plot of CCA. The num- 
bers represents class centers 



nonlinearization by kernel method and avoiding over- 
fitting by regularization technique. 

In general, it is important to determine the regu- 
larizaion parameter. Moreover, the selection of ker- 
nel form is crucial for the performance. Although 
all parameters are determined by hand in the sim- 
ulations of this paper, we can take more system- 
atic approaches, such as resampling methods like 
cross-validation and emprical Bayes approaches 0. 
In such techniques, we usually need iterative algo- 
rithms which is time consuming and is also likely to 
be trapped into a local optimum. To examine such 
issues are future work. 

As for regularization term, we can use ||a|p and 
1 1/3 IP instead of the quadratic term of regularization 
in this paper. In the kernel discriminant analysis de- 
scribed below, such a different type of regularization 
term is used. The time complexities for both types 
are same and empirically we are not able to find sig- 
nificant difference of performance. However, we may 



need more realistic experiments. 



5.2 Relation to kernel discriminant 
analysis 

The canonical correlation analysis is closely related 
to the Fisher's discriminant analysis (FDA), which 
finds a mapping that minimizes the inner-class vari- 
ance as well as maximizes inter-class variance for ef- 
fective pattern recognition. FDA can be considered 
as a special case of CCA. Mika et al[5] has proposed a 
kernel method for FDA, which is not strictly included 
into the kernel CCA because the kernel FDA does not 
transform the class label by nonlinear mapping. For 
both in kernel CCA and kernel FDA, it is difficult to 
obtain sparse representation of mapping. It would be 
promising idea to incorporate the sparsity as a utility 
function. 
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Figure 6: Simulation 2. x-y plot of KCCA 



5.3 Future issues from the informa- 
tion theory 

The author's group has been proposed the multi- 
modal independent component analysis (multimodal 
ICA) which extends the CCA by incorporating the 
information theoretic viewpoint 2 . The transforma- 
tion is restricted to linear and it has been sometimes 
difficult to extract useful features from nonlinearly 
related multivariates. Now we can raise a question: 
Can we integrate the kernel CCA with multimodal 
ICA in order to extract useful features? 

The answer for this question depends on the prop- 
erty of given data. If the noise level is low as in the 
simulation of this paper, the regularization constants 
are set to small values and it is desired that the corre- 
lation coefficients are almost 1. We cannot expect the 
performance is improved by multimodal ICA because 
the correlation coefficient close to 1 already achieves 
a large amount of mutual information. 



On the other hand, when the noise level is large, 
the multimodal ICA possibly improves the perfor- 
mance. However, in such a case, the linear CCA is 
sometimes enough in practice. If we learn a mul- 
tiple value function as in the aquisition of multiple 
consept[I], it may worth trying because the correla- 
tion coefficients are small even if the noise level is 
low. 

Let us consider further the case the noise level is 
low. From the result of the simulations in the previ- 
ous section, samples are mapped into a few clusters 
that will make regression between x and y difficult. 
In such a case, the distribution of u and v is de- 
sired to be scattered. From the information theoretic 
viewpoint, the feature space is preferrable to have 
large amount of entropy. Since the distribution with 
largest entropy is Gaussian under fixed average and 
variance, the Gaussianity can be used for the utility 
function. For example, the third and forth cumulants 
are preferrable to be as small as possible. It seems 
opposite from the projection pursuit and independent 
component analysis, but it may be caused from the 
difference of the purpose that ICA is for visualization 
while our task is for regression. The assumption of 
noise is also different. These issues are related to the 
sparsity stated in the previous section, and it is also 
future work. 
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